Skip to content

ModernBERT Implementation#2518

Open
maitry63 wants to merge 4 commits intokeras-team:masterfrom
maitry63:modernbert-implementation
Open

ModernBERT Implementation#2518
maitry63 wants to merge 4 commits intokeras-team:masterfrom
maitry63:modernbert-implementation

Conversation

@maitry63
Copy link

@maitry63 maitry63 commented Jan 12, 2026

This PR continues work from closed PR #2256

ModernBertBackbone: Support for dynamic configurations, alternating between Global and Local (Sliding Window) attention.

Key Architectural Components: Implementation of RoPE (Rotary Positional Embeddings), GeGLU activation, and RMSNorm.

ModernBertMaskedLM: Masked Language Model task for pre-training and fine-tuning.

ModernBertTokenizer: A dedicated tokenizer compatible with the reference implementation.

Reference

Original PR - #2256

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have followed the Keras Hub Model contribution guidelines in making these changes.
  • I have followed the Keras Hub API design guidelines in making these changes.
  • I have signed the Contributor License Agreement.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @maitry63, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a complete implementation of the ModernBERT architecture within Keras Hub. It encompasses the core backbone with its innovative attention mechanisms and normalization techniques, a specialized masked language model for pre-training, and a custom tokenizer to process input data effectively. The changes aim to provide a state-of-the-art BERT variant with improved architectural components.

Highlights

  • ModernBertBackbone Implementation: Introduces a new backbone model supporting dynamic configurations, including alternating Global and Local (Sliding Window) attention mechanisms.
  • Advanced Architectural Components: Integrates Rotary Positional Embeddings (RoPE), GeGLU activation functions, and RMSNorm for enhanced model performance and stability.
  • Masked Language Model (MLM) Task: Provides ModernBertMaskedLM for pre-training and fine-tuning, leveraging the new backbone.
  • Dedicated Tokenizer: Includes ModernBertTokenizer, a byte-level BPE tokenizer compatible with the ModernBERT architecture and its special tokens.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@maitry63 maitry63 mentioned this pull request Jan 12, 2026
6 tasks
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive implementation of the ModernBERT architecture, including the backbone, tokenizer, preprocessor, and a masked language model task. The code is well-structured and generally follows the repository's contribution guidelines. However, there are several critical issues related to model and layer implementation, serialization, and testing that need to be addressed. Key issues include incorrect model input/output definitions in the masked LM task, broken serialization in custom layers, and several bugs in the test suites that will prevent them from passing. I've provided detailed comments and suggestions to fix these issues.

Comment on lines 4 to 6
from modernbert_layers import (
ModernBertMLP, ModernBertAttention, ModernBertEncoderLayer,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The import from modernbert_layers import ... is incorrect. This relative import will fail when the tests are run. Please use the full import path from the project root.

Suggested change
from modernbert_layers import (
ModernBertMLP, ModernBertAttention, ModernBertEncoderLayer,
)
from keras_hub.src.models.modernbert.modernbert_layers import (
ModernBertMLP, ModernBertAttention, ModernBertEncoderLayer,
)

Comment on lines +8 to +11
def setUp(self):
self.vocab = ["[CLS]", "[PAD]", "[SEP]", "air", "Ġair", "plane", "Ġat"]
self.vocab += ["port", "[MASK]", "[UNK]"]
self.vocab = dict([(token, i) for i, token in enumerate(self.vocab)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The vocabulary used in this test does not match the special tokens defined in ModernBertTokenizer. The test uses [CLS], [PAD], [SEP], while the tokenizer expects <|endoftext|>, <|padding|>, and <mask>. This will cause a KeyError during tokenizer initialization. Please update the test vocabulary to use the correct special tokens.

Suggested change
def setUp(self):
self.vocab = ["[CLS]", "[PAD]", "[SEP]", "air", "Ġair", "plane", "Ġat"]
self.vocab += ["port", "[MASK]", "[UNK]"]
self.vocab = dict([(token, i) for i, token in enumerate(self.vocab)])
self.vocab = ["<|endoftext|>", "<|padding|>", "<mask>", "air", "Ġair", "plane", "Ġat"]
self.vocab += ["port", "[UNK]"]
self.vocab = dict([(token, i) for i, token in enumerate(self.vocab)])

Comment on lines 12 to 18
self.init_kwargs = {
"vocabulary_size": 10,
"num_layers": 2,
"num_heads": 4,
"hidden_dim": 8,
"intermediate_dim": 32,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The init_kwargs dictionary is missing the required local_attention_window argument for instantiating ModernBertBackbone. This will cause a TypeError when running the tests. Please add this argument to the dictionary.

        self.init_kwargs = {
            "vocabulary_size": 10,
            "num_layers": 2,
            "num_heads": 4,
            "hidden_dim": 8,
            "intermediate_dim": 32,
            "local_attention_window": 128,
        }

Comment on lines 48 to 54
def pack_inputs(self, inputs):
"""Pad and truncate to the target sequence length."""
return ops.pad(
inputs,
axis=-1,
constant_values=self.tokenizer.pad_token_id,
)[:, :self.sequence_length]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The usage of ops.pad is incorrect. It does not accept an axis argument. You need to provide a paddings tensor that specifies the padding for each dimension. For padding to a fixed sequence length, you should calculate the required padding length and apply it to the sequence dimension.

Suggested change
def pack_inputs(self, inputs):
"""Pad and truncate to the target sequence length."""
return ops.pad(
inputs,
axis=-1,
constant_values=self.tokenizer.pad_token_id,
)[:, :self.sequence_length]
def pack_inputs(self, inputs):
"""Pad and truncate to the target sequence length."""
shape = ops.shape(inputs)
pad_length = ops.maximum(0, self.sequence_length - shape[-1])
paddings = [[0, 0] for _ in range(len(shape) - 1)] + [[0, pad_length]]
padded_inputs = ops.pad(
inputs,
paddings,
constant_values=self.tokenizer.pad_token_id,
)
return padded_inputs[..., : self.sequence_length]

Comment on lines 15 to 21
self.backbone = ModernBertBackbone(
vocabulary_size=100,
num_layers=2,
num_heads=2,
hidden_dim=16,
intermediate_dim=32,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The instantiation of ModernBertBackbone is missing the required local_attention_window argument. This will cause a TypeError when setting up the test. Please provide a value for this argument.

        self.backbone = ModernBertBackbone(
            vocabulary_size=100,
            num_layers=2,
            num_heads=2,
            hidden_dim=16,
            intermediate_dim=32,
            local_attention_window=128,
        )

Comment on lines 22 to 28
def test_serialization(self):
layer = ModernBertEncoderLayer(
hidden_dim=16, intermediate_dim=32, num_heads=2, local_attention_window=64
)
config = layer.get_config()
new_layer = ModernBertEncoderLayer.from_config(config)
self.assertEqual(new_layer.local_attention_window, 64) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The serialization test is quite minimal. It only checks if one attribute is correctly restored after calling from_config. According to the style guide (line 412), you should use self.run_layer_test() to verify layer functionality more thoroughly, including serialization, shape inference, and training. This would provide a much more robust test.

References
  1. The style guide recommends using self.run_layer_test() for testing individual layers to ensure all core functionality is covered. (link)

Comment on lines 16 to 27
"""ModernBERT tokenizer based on Byte-Pair Encoding (BPE).

ModernBERT uses a byte-level BPE tokenizer. This class handles the
transformation of raw text into token IDs and manages special tokens
such as [PAD], [CLS], and [MASK].

Args:
vocabulary: dict or str. A dictionary mapping tokens to IDs, or a path
to a JSON file containing the vocabulary.
merges: list or str. A list of BPE merges, or a path to a merges file.
**kwargs: Standard `BytePairTokenizer` arguments.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring is missing an Example section, which is required by the repository's style guide (line 369). Please add a usage example for ModernBertTokenizer.

References
  1. Docstrings must include comprehensive examples showing usage patterns. (link)

Comment on lines 114 to 116
"""
ModernBERT Encoder Layer.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for ModernBertEncoderLayer is just a title. It's missing the Args and Example sections required by the repository's style guide (lines 529-530). Please provide a complete docstring that documents the layer's arguments and includes a usage example.

References
  1. Layer docstrings must document all parameters and include usage examples. (link)

intermediate_dim,
num_layers,
num_heads,
local_attention_window,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The __init__ method is missing a default value for the local_attention_window argument. The docstring on line 26 specifies (default 128), but this is not reflected in the method signature. Please add the default value to the signature to match the documentation and avoid TypeError when the argument is not provided.

Suggested change
local_attention_window,
local_attention_window=128,

@maitry63 maitry63 closed this Jan 13, 2026
@maitry63 maitry63 reopened this Jan 13, 2026
@maitry63 maitry63 force-pushed the modernbert-implementation branch from 0020e26 to 89e7dcb Compare January 15, 2026 19:24
@sachinprasadhs sachinprasadhs added the new model For PRs that contribute a new model to the Keras Hub registry. label Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model For PRs that contribute a new model to the Keras Hub registry.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants